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1 Abstract 


Multilayer neural networks, trained by the backpropagation through time algorithm (BPTT), have 
been used successfully as state- feedback controllers for nonlinear terminal control problems. Current 
BPTT techniques, however, are not able to deal systematically with open final-time situations such 
as minimum-time problems. Two approaches which extend BPTT to open final-time problems 
are presented. In the first, a neural network learns a mapping from initial-state to time-to-go. 
In the second, the optimal number of steps for each trial run is found using a line-search. Both 
methods are derived using Lagrange multiplier techniques. This theoretical framework is used to 
demonstrate that the derived algorithms are direct extensions of forward /backward sweep methods 
used in N-stage optimal control. The two algorithms are tested on a Zermelo problem and the 
resulting trajectories compare favorably to optimal control results. 
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2 Introduction 


The use of neural networks as controllers for dynamic systems is currently the subject of much 
interest. Controllers for linear systems can be designed using well-established techniques, but no 
general design approaches yet exist for the larger class of nonlinear systems. A neural network can 
be a useful alternative tool for the synthesis of nonlinear controllers. The feasibility of training 
neural controllers has, moreover, been demonstrated by numerous applications. 

As distinguished by Bryson and Ho [1], there are two main classes of controllers: regulators 
and terminal controllers . A regulator maintains the state of the system about some known refer- 
ence. It accomplishes this despite external disturbances and internal uncertainties. Narendra and 
Parthasarathy, among others, have analyzed neural network regulator structures [2,3], training their 
networks using Williams and Zipser’s dynamic backpropagation algorithm [4]. Unlike a regulator, 
a terminal controller drives the plant to some final state while maintaining acceptable state and 
control values along the trajectory. Whereas a regulator continues its task indefinitely, a terminal 
controller stops when the desired final state is reached. One example of a terminal controller is the 
truck-backer of Nguyen and Widrow [5] in which a network was trained, using a variant of Werbos’s 
backpropagation through time (BPTT) algorithm [6], to implement a state-feedback control law. 

An important class of terminal controllers is that in which the controller does not know, and 
must optimize, the number of steps along the trajectory. However, current BPTT techniques for 
terminal controller design do not provide a systematic way of incorporating the time elapsed along 
a trajectory as part of the cost function. As a result, it is not practical to solve problems such 
as minimum-time control. On the other hand, such “open final-time” problems have been dealt 
with for several decades using classical optimal control methods [1]. However, as I point out in 
section 3, these optimal control techniques typically find a set of open-loop control vectors which 
are valid only along a single trajectory. This is a limitation, as often control over many trajectories 
is needed. Such control can be accomplished, if full state information is available, by capturing the 
optimal state-feedback control mapping over the desired range of state-space. 

Dynamic programming [7,8] provides one way of computing this optimal feedback control; for 
problems with many dimensions, however, the computation and storage requirements of dynamic 
programming are prohibitive. A second, less problematic, technique involves the precomputation 
of a number of nominal optimal paths and the subsequent use of second variation methods to 
find optimal solutions near one of the precomputed trajectories [1]. However, this requires both 
deciding upon a good representative set of nominal trajectories as well as explicitly storing the set 
of computed control vectors. 

Sigmoidal feedforward neural networks with 2 layers of neural elements are capable of approxi- 
mating any sufficiently well-behaved function, given that they contain a sufficient number of nodes 
in the hidden layer [9,10]. In consequence, they can be used to approximate the control law with- 
out explicitly storing control vectors over the state-space. A way of combining the state-feedback 
structure of a neural network controller with the open final-time methods of optimal control would 
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permit the application of neural networks to the class of terminal control problems described above. 
This paper derives two such extensions to BPTT using well-established Lagrange multiplier meth- 
ods. A similar theoretical framework for backpropagation has already been proposed by le Cun 
[11]. This paper extends the framework to describe a state-feedback control structure. 


3 Review of N-Stage Optimal Control 


Before considering the neural network problem, we first review N-stage optimal control methods for 
designing terminal controllers. A typical N-stage problem can be phrased as: choose the state-space 
trajectory, x = [r(0), . . . ,x(A)], and open-loop control sequence, u = [u(0), . . . , u(N — 1)], for the 
discrete-time plant, /, which minimize a cost function subject to the constraints of the system: 

/ N - 1 

J° = min I <^[z(A0] + ^ £[*(*). «(*)] 

* \ 1=0 

x(0) — xo, known initial condition 

x(i+l) = /(x(i),ii( *')), i = 0,... , N- 1. (2) 

The key points to observe are 1) the solution is found for a single trajectory, 2) the problem 
requires explicitly finding the open-loop control vectors at each increment in time. This problem 
can be solved by converting it into a two point boundary value problem (TPBVP). To do this, the 
plant-update equation is first adjoined to the cost function using a Lagrange multiplier sequence 
or adjoint vector sequence, A : 



N-l, v 

J = «^[ar(jV)] + ^2 ( £[*(*). “(»)] + A (* + l) T (/i - *(» + !)))• 

For notational convenience, a Hamiltonian sequence Hi is usually defined as 

Hi = L[i(i), u(i')] + A(» + 1 ) t /(i(j),u(j')), i = 0,..., AT — 1. 


Substituting this H{ into J, rearranging terms, and considering differential changes in J due to 
changes in x and u gives 


dj = 




\dx(N) A(/V) J-'- ' ■ 0*(O)~ VW ' ■ du( 0) 


Since ar(O) is fixed, da:(0) = 0. By the Kuhn-Tucker conditions [12], in order to have optimal x(i) 
and u(i), we must have the gradient vector VJ = 0. Thus, we need dj = 0 for all choices of 
dx( 1), . . . , dx(N), and du( 0), . . . , du(N — 1). For this to hold, we must have 


A(A rf 


dcf> 

dx{N ) 


( 3 ) 
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The terminal value of the Lagrange multiplier sequence A is given by equation 3. Earlier values 
of A are then found using the iterative procedure given by equation 4. In order to have optimal 
u(i) and x(i), the optimality condition (eqn. 5) must be satisfied. These equations define a TPBVP 
which typically must be solved numerically. One numerical method [1] consists of guessing an initial 
control sequence, uo(f), and then making many trial runs of the system (eqns. 1-2) After each run, 
the terminal error is swept backward (eqns. 3-4) and the control vector at each time step is updated 
by gradient descent: 

ut +‘ (i > = “* (i) “K^ +A(, ' +i)T ^fe)*' ,= ° n ~ i - w 


This derivation assumed that the number of time-steps was known, an assumption which is also 
implicitly made in BPTT. In many terminal control problems, the final time, f/ } is not known a 
priori and an optimal selection of this final time must be made. To do this in optimal control, the 
problem is typically solved in continuous time so that the number of control parameters does not 
change as tj is changed. Thus we have: 

J ° ~ (#*(*/)>*/] + [ ' L[x(t) } u(t)]dt 

x u ,tj \ Jo 

z(0) = zo 

m = /(*(<),«(<))• 


This problem is similarly converted to a two-point boundary value problem with continuous forward 
and backward equations. In addition to requiring that an optimality condition similar to equation 5 
holds, this procedure requires that the transversality condition 


0 



(U + L ^' + u (0)) t _ t 


( 7 ) 


be satisfied at the terminal point. This can be done by using gradient descent on tj l ) an approach 
which motivates method 1 of section 4. 

3 One such routine, fcnopt , written by Bryson [13], numerically integrates the continuous time system using a fixed 
number of steps. The integration time step, At, is then varied by gradient descent in order to effect a change on tj. 
This provides a way of varying tj without changing the number of control vectors. The routine is used for comparison 
purposes in section 5. 
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Neural Network Terminal Controller 


4,1 Optimal Control Formulation of Problem 

Given this background, we can now describe the neural network control structure shown in figure 1. 
As before, the block / is a discrete-time model of the dynamic plant with sampling interval At. The 
model determines the next state of the plant given the current state x(i) £ D? n and control u(i) £ 
9? m . The block g is a nonlinear state- feedback controller consisting of a multilayer feedforward 
neural network with weight vector 6 £ Of course, the state vector is assumed to be available to 
the controller at each iteration. Although this paper focuses on multilayer networks, the algorithms 
derived are equally applicable to any parameterized mapping which is differentiable with respect 
to the input and parameter vectors. 

neural network discrete plant model 



Figure 1: Feedback control loop with neural network controller . 

So that we can deal systematically with having unknown trajectory lengths, we define a “time- 
to-go” function N(x 0 ) which maps the initial state to the length of the associated trajectory. This 
is analogous to the parameter tj described in section 3. Furthermore, in order to formulate th<* 
optimization problem over many trajectories simultaneously, we assume that the initial state is a 
discrete random vector 2 X 0 taking on values x Q £ {xq, • • • > *o } with some probability mass function 

P(Xo = *g). 

The choice of N(xq ), 6, and x 0 determines a state trajectory x(x o ,0) = |x(0), . . . , x(jV(x 0 ))] 
and control sequence u(xo,0) = [u(0), . . . , u(N(xq) — 1)]. Given these sequences, we can define a 
trajectory cost J(x(xo, 0)> u(xo, 0), N(xo)). We could then attempt to find the 6 and N(x o) which 
minimize the expected cost over all trajectories, 

P 

JO = P(*o = *5), (8) 

&,y v (x 0 ) p _ 1 

2 Justification of the derivations when Xo is a continuous random vector is not presented due to the subtleties 
involved in applying the Kuhn-Tucker conditions when the parameter vector is not in 9? n . 
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while implicitly incorporating the system constraints by judicious application of the chain-rule as 
the gradient V$J is determined. If we disregard the minimization over jV(xo) in equation 8 by 
assuming some value of N for each trajectory, then this unconstrained problem is equivalent to 
the usual formulation of BPTT. An alternative method is to rephrase equation 8 as a constrained 
optimization problem with the system constraints made explicit. To do this, we assume that the 
state and control sequences can be chosen independently of 6 and then attempt to find the x(xq), 
u(x 0 ), 6 , and N(xq) which minimize the expected cost 


J° = 


min 53 / (x(ig), u(xg), N (xg)) P(X 0 = xg) 


xfrohufxoJAN'fro) “ 


while also satisfying the system constraints, 


( 9 ) 


* p (0) = 

*0 . 



(10) 

x p (i +1) = 

f(x p {i),u p (i)) , 

i = 0, , N(xq) — 

1, p=l,...,P 

(ii) 

u»(i) = 

g(xP(i),0) , 

i = 0,... , iV(xg) — 

t—* 

“a 

II 

►—* 

"0 

(12) 


along each trajectory. For notational convenience, we will use the expected value operator and 
drop the explicit dependence of x and u on zg. 


4.2 Choice of Cost Function 


Equation 9 is a general form of the cost function. In this paper, a specific cost function is phrased 
as a Bolza problem: 


J 


E 


*[x(jV(zo)),iV(*o)At] + 


N(xo)—l 

Z 


i = 0 


L[x(i),u(i)]l . 


(13) 


The term L is used to incorporate cost that is accumulated along the trajectory. For example, a 
quadratic function of the control vector will minimize control effort. Path state-constraints can be 
implemented as part of L using barrier-function methods [12]. The terminal cost 4> can be used 
to implement soft terminal constraints. Note that hard terminal constraints which must be met 
exactly typically result in an ill-posed problem when this neural network structure is used. As an 
example cost function, consider 


4>[x(N°),N°At ] = Q t N°At + (z d -x(N°)) T Q x (x d -x(N°)). (14) 


This <f> is linear in the final-time and is a quadratic function of the difference between the final state 
and a desired state. The matrix assumed to be symmetric and positive semi-definite, weights 
the various state terms. Using this cost will minimize the trajectory time while approximately 
maintaining the desired end condition. More complicated functions of final state and time can be 
used as well. The soft terminal end-constraint on the state is adequately dealt with using standard 
BPTT. The extensions presented in this paper are designed to incorporate terms such as Qt N°At 
which involve final time. 
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4.3 Solution of the Optimal Control Problem 

We now derive two algorithms for solving the constrained optimization problem given by equa- 
tions 10-12 and equation 13 using the optimal control methods introduced in section 3. Although 
somewhat complex, the derivations yield very simple, intuitive extensions to BPTT. Both algo- 
rithms use the same stochastic gradient descent structure as BPTT. The training procedures con- 
sist of a sequence of trial runs of the system with initial states drawn independently from the 
distribution of Xq. Each run is followed by a computation of the terminal error, a backward sweep 
of that error, and a weight update based on the instantaneous gradient from that one sweep. The 
new algorithms differ from BPTT in the selection of the stopping point for the forward sweep. 

4.3.1 Method 1: Explicit Model of “Time-To-Go” 

In the continuous open final-time problem of section 3, the final time tj was treated as a parameter 
and gradient descent was used to update that parameter. Here we extend this idea to neural 
network controllers by attempting to explicitly determine the optimal trajectory lengths. In order 
to use gradient descent, however, we replace N(x 0 ) with the continuous time-to-go function 
To permit this, we first compute the state history x(0), . . . ,x(N) along each trajectory by forward- 
iterating the system N = [f/(x 0 )/AtJ steps using equations 10-12. We then simulate the effect of 
running the system for one partial step of length orAt = f/(^o) — A tN and find by linear 

interpolation: 


x{i s ) « (1 - a)x(N) + ax(N + 1) 

- (1 ~ <*)z(N) + af(z(N), u(N)). (15) 


We also augment the cost function (eqn. 13) to reflect the cost incurred during this partial step: 


J 


N-l 


E 


*[«(*/).*/] + «(A0] + £ £[*(*), «(0] • 

«'=o J 


(16) 


The x(ar 0 ), u(z 0 ), 0, and t j(x 0 ) which minimize this cost function, subject to the system constraints 
(eqns. 10-12), are then sought. To do this, we first adjoin both the plant constraints (eqn. 11) and 
controller constraints (eqn. 12) to the cost function using two sets of Lagrange multiplier sequences, 
A/(io) = [A / (0),...,A/(AT),A/(< / (a;o))] and A ff (x 0 ) = [A s (0), . . . , \ g {N)]. Like x and u, these are 
ensembles of sequences indexed by z 0 . The adjoined cost function becomes: 


J 


E 


<£[*(</)></] + aL[*(AT), u(N)] 

+ A/(</) T ((l - a)x(N) + af N - i(t/)) + A ff (AT) T ( 9N ~ n{N)) 

N ~l f -I 

+ £ U[*(i), «(*')] + A/(i+ 1) T (/, - *(*+ 1)) + A 3 (0 T (sf, - 11 (f)) } . 
1=0 ^ * 
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An ensemble of Hamiltonian sequences H{(x o) is now defined as 

Hi = L[x(i ), u(i)] + A f (i + l) T f(x(i), u{i)) + A g (i) T g(x(i), 0), 

i = 0,. AT - 1, V*o€{*S,---,*o} 

Hn = ocL[x(N), u(N)] + A,(</) t ((1 - a)x(N) + af(x(N), u(N)j) + \ g (N) T g(x(N), 0), 

Vx 0 € {lo.- '.^o}- 

Substituting the Hamiltonian sequences into J , we simplify the expression to 


J = E 


^[*(</)>*/] + (#JV - A/(</) T 2:(t/) - \ g (N) T u(N)j 

N- 1 

+ E (# - A /(*' + + 1) - MOM*')) 


1=0 


Replacing with (iV + a) At and rearranging terms gives 
/ = E 


(*[*(</), (iV + a)A<] - \ f (t f ) T x(t f )) + H 0 - A fl (0) T u(0) 

+ E(^-A/(0^(0-A a (0 r u(i)) 


1 = 1 


We now consider differential changes in J due to changes in 9 } a, x(l), . . . , x(N), x(tf) } and 
u(G), . . . , u(N). We require that admissible changes in a be small enough that [{tf + dtj)/At] = 
[f//AtJ , thus allowing N to be treated as a constant. We then have 

(d^T) - A ' (1 ' )T ) ^ 

N 


dj = E 


dH o 


chr(0) 


dx( 0) 4- 




N 


+ h(^ de+ {% A,+ ^) 

Since x(0) is fixed for each trajectory, dx( 0) = 0. 

The function tf(x q) can be implemented using a second neural network with weights i?, 

tj = r(x(0),d), 

which will approximate the optimal time-to-go function to any required accuracy. The variation of 
the step ratio oc can now be written as a function of the variation of the network weights t?; 

“ = Z7 t, ~ N 

= ^(x„ ,*)-N 

da = 4~T- dfl - 
At dd 


d<f> 


0Hn\ 


da . 
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Substituting this expression for da back into dj yields 

6<t> 


dj = E 


dx(t f ) 


) dx{tj) 




N 


+E 

i=0 


de ) d + (dt/ + A t da ) dti dd 


In order to have optimal x(i) and u(i), we must have VJ = 0, which requires that along ev- 
ery trajectory dj = 0 for all choices of d6, dd, dx(l), . . . ,dx(N),dx(t f ), and du(0), . . . ,du(N). 
Therefore, 


\ } {tJ) T = 

d<t> 

dx(t f y 

V* 0 6 

{*o. 

■■’ x o} 


(17) 

w = 

dHi 

dXi’ 

i = 1 ,..., 

N, 

Vi 0 e {^o. - - 


(18) 

\ g {i) T = 

dHi 

du, ’ 

1 = 0,..., 

,N, 

V*o G {iQ. • • 


(19) 


0 

0 


E 

E 


r n 

E 

i=o 


dHi 

de 


d<f> 

dtj 


+ 



1 dHA 

At da ) 



( 20 ) 

( 21 ) 


The terminal condition on Ay is given by equation 17. It is identical to the expression found for 
N-stage optimal control (eqn. 3). We can expand the equations for the adjoint vector (eqns. 18-19). 
For i — M we have 




a 


( OLn 
\du(N) 

dLjv 


+ -fy(*/)‘ 


^ T = 


+ (1 - a) Xf(tfY 


( 22 ) 

(23) 


These expressions sweep the terminal error Xj(tf) back through the simulated final step. For 
i = 0, . . . , N — 1 equations 18-19 become 


A/( if 


dLj 

du(i) 

dL, 

dx(i) 


+ \/(i + l) 3 


+ Xf(i 1 ) 


dji 

du(i) 

T 


- dji \ 

dx{i)) 


+ A,(*y 


dg, 

dx(i)' 


(24) 

(25) 


These expressions sweep the error at time step N back through the iterations of the feedback loop. 
Note that the backward sweep equation for the neural network structure (eqn. 25) and for optimal 
control (eqn. 4) differ only in the term dgi/dx(i). This term appears in equation 25 because 
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in a feedback control loop, the u(i) are determined by x(i). The optimality condition given by 
equation 20 is analogous to equation 5 derived for N-stage optimal control. In optimal control, 
however, the gradient is found with respect to each of the control vectors and gradient descent is 
directly employed on each of these vectors separately (eqn. 6). Here, the average weight gradient 
over the trajectory is used since the same control weights must be used at each stage. In practice, 
a weight update is made after each forward/backward sweep using the instantaneous gradient: 

0k+i(i) = «*(0-wE(M0 T ^) t - (26) 

To analyze this stochastic gradient descent over time, we consider the total weight change after K 
random trajectories: 


A 9 = 


K N 

-wEE 


k = i t=i 



If fig is small, we can assume 0, at 0 2 « - ■ ss 9k ■ Furthermore, the probability that xj is chosen 
on the k th iteration is P(Aq = x£). Thus the summation converges to 


A9 


-fioK E 


' N 

E 

Li=l 




We see that, as with the LMS algorithm, the slow adaptation process smooths the weight-gradient 
estimate so that the change in the weight vector follows the true gradient on average [14], 


So far this derivation has produced an algorithm exactly like BPTT with an added partial 
final-step. The critical new term, however, is the transversality condition (eqn. 21) which can be 
expanded to give 

T 

T dd 

where 




* = (£h&+W/> r ( 


f[x(N),u(N)]-x(Ny 

A t 


(28) 


will be called the final-time error . Equation 28 is a discrete-time approximation of equation 7. 
To improve the final-time, we can compute \ r after each sweep, backpropagate it through the 
network r, and update the time-to-go network weights based on the resulting instantaneous gradient 
estimate: 

-\ 


1 h+i(i) = 0*(*)-w(At|J 


)l 


(29) 


To gain some insight into the transversality condition, consider the three terms of the final-time 
error (eqn. 28). The first term, d<f>/dtj ) is the direct effect on the terminal cost of varying tj . The 
second term, is an incremental trajectory cost. The last term can be rewritten as 

xr/7[«(*0.«W]-jW) ^ p| T pjl 


Wf ) 1 


At 
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This is the indirect effect on the terminal error due to the terminal state changing. In summary, 
the algorithm uses the network r to predict the stopping time. The final-state error (eqn. 17) for 
the given prediction of tj is backpropagated through the stages of the feedback loop to update the 
controller network weights and the final- time error (eqn. 28) is backpropagated once through r to 
update the time-to-go network weights. 

4.3.2 Method 2: Implicit Determination of “Time-To-Go” 

Although the optimization problem consists of finding both the optimal 9 and N(x o), once training 
is complete it is often only necessary to store 9 . The evolution of the actual physical control system 
will typically provide a natural stopping point. In training, however, some method must still be 
used to determine at what time-step to stop the forward run and compute the terminal error. In 
this second approach, we assume that once the controller is trained, the time-to-go mapping is no 
longer necessary and thus it is not explicitly stored. This, as we shall see from the experimental 
results, yields a simpler and more robust algorithm. 

In order to optimize the choice of time-to-go function N(xq) ) we begin by reconsidering the 
general cost function (eqn. 9) and distributing the minimization over N across the expected value 

r = x ( „).u!5f"».»(», P 5 / ( x(lS, ' uW )' A ' (lS )) p(*„ = iS) 

= P < X » = *S). 

We are now minimizing the cost function over N € Z + for a specific trajectory and choice of 6. 
The distribution is allowed since the original minimization on N(x 0 ) was taken over dt n x 2 + . The 
minimizing value of N within the expected value will depend on the choice of xo and 6 and is 
defined as 

N° = N°(x o ,0) = argmin/(x(zo),u(x 0 ), AT). 

Note that this is not the desired function N°(x o) unless we have optimal 9. Assuming for the 
moment that N° is known, we can substitute it back into the expression for J : 

J ° = P < A 'o = *S)- 

Effectively, we have projected the problem of optimizing over x, u, 9 and A r (x 0 ) into a problem of 
only optimizing over x, u, and 9. We will come back to how to determine N°. 

First we consider how to find the optimal 9 by again deriving a set of necessary first-order 
stationary conditions. As before, we seek the x, u, 0, and N(xo) which minimize 

N(x o)— i 

^[*(^(*o))> AT(x 0 )At] + Y, £[*(*'),«(*■)] 

»=o 
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subject to the constraints given by equations 10-12. Substituting in, the as of yet unknown, N° 
we can write 


J 


N°-\ 

[x(N°),N°At] + £ «(0] 

1=0 


(31) 


We adjoin the system constraints to the cost function using two Lagrange multiplier sequences, Xj 
and X g , as before: 


/ = E 


N °- 1 


+ Y { “(*’)] + M* + !) r (/» - *(*' + !)) + V*) T (ff« - u (*)) } 

*=0 ^ J - 

We substitute the Hamiltonian sequence, 

Hi = L[i(j'),u(i)] + A/(i + 1 ) T /(*(*)> u (0) + A fl (i) r fl(*(O.0). i = 0, . . . , N° - 1, 

into J and rearrange the terms to produce 


J = E 


<t>[x(N°), AT 0 At] - A / (tV°) T x(At°) + - A 3 (0 ) t u(0) 

+ £ (tfi-A/(i) T z(0- A fl (j) T iu(»)J . 

1=1 ' 


Consider differential changes in J due to changes in 9, z(l), . . . , x(N ° ), and u(0), . . . , u(jV° — 1). We 
require that admissible 50 are chosen small enough that N°(xq,9) = N°(xq,9 -f 59), thus treating 
N° as a constant: 


dj = 



dx(N°) 




du(i) 


Again, we know that d: r(0) = 0. In order to have optimal 9, x(i) and u(i), we require that dj = 0 
for all choices of dO , <Lr(l), , . . , dx(A r °), and du(0), . . . , du(N° — 1) along every trajectory, thus 
giving 


\ f (N°) T 

M if 


d<p 

dx{N°)’ 

dHj_ 

dx(i) ' 


Vx 0 € j^oi (32) 


12 


( 33 ) 


dLj 
dx( 
dHi 


. r -\T U 11 * f , 

sW = i = 0 N ‘ -*■ Vr„e{4,.. 

dH- 


■.*?} 


0 = E 


= E 


dLj 

du(i) 

r/v°-i 

£ 

. *=o 

N°-l o ' 

,£ v "'w 


(34) 


dB 


(35) 


Equations 32-34 are the same terminal condition and backward sweep equations as equation 17 
and equations 24-25. The optimality condition (eqn. 35) is again satisfied using stochastic gradient 
descent: 


N°~\ , Q A 

Bk+i(i) = Bk(i)-ne ]T 

The extension beyond BPTT in this algorithm is the new expression 


(36) 


N° 


N - 1 


arg min 
N 


<l>[x(N),N At] + ^2 L[x(i),g(x(i),6)] 


t=0 


(37) 


which takes the place of the transversality condition of method 1 (eqn. 27). This expression is 
not in the form of a stationary condition but is rather in the form of an explicit expression for 
minimum value. Application of this second method is straight-forward: we simply stop the forward 
run at the time-step which minimizes the total cost function for the current value of 6. Note that 
this is not necessarily the time-step where the state is closest to the terminal position. We then 
use equations 33-34 to propagate the error at that time-step through the control feedback loop. 
In practice we assume that we can find some upper bound on N°, N max . The forward sweep is 
terminated after N max iterations, J ^x(ar 0 ), u(z 0 ), Af) is computed for each N = 0, . . . , N maX} and 
the time-step with minimum value of J is chosen as N° for that trajectory. 


4.4 Comparison of Optimal Control Formulation to BPTT 


It is instructive to compare the Lagrange multiplier equations just derived with the calculations car- 
ried out in BPTT. First consider equation 32. If we used a quadratic soft terminal-state constraint 
(eqn. 14), we can evaluate the terminal value of the Lagrange multiplier sequence as 


X f (N) T 


d<f> 

drtW) 

d((x d - x(N)) t Q x (x d - i(AT))) /dx{N) 
2 {x d -x(N)) T Q x . 
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This is the same scaled state-error used in BPTT. 

Next consider the backward equations (eqns. 33-34). Figure 2 shows a graphical representation 
of the various components of these equations for one time-step. The thin lines are the signal flow 
for the state and control vectors. The bold lines are the signal flow for the adjoint vectors. We 
can interpret A j(i) to be the squared-error derivative at the plant output at the i th stage and A 5 (I) 
to be the squared-error derivative at the controller output. With these interpretations, we see 
that the adjoint vector equations exactly describe the BPTT process and that the A are, in fact, 
the backpropagated quantities. The term A g (i) T dgi/dx(i) is computed in BPTT by propagating 
the error A 5 (f) through the controller network using the backpropagation algorithm described by 
Rumelhart [15] with internal activation values determined by the forward sweep of x(i). This back- 
propagation also computes the weight gradient component A g {i) T dgi/dd. The error components 
A j(i+ 1 ) T dfi/du{i) and A/(z + \) T df x fdx(i) can be computed in several ways. If / is a neural net- 
work model of the plant then backpropagation can be used, as in the work of Nguyen and Widrow 
[5]. If the equations of the plant are known, then the Jacobian matrices f u (i ) — dfi/du(i) and 
f x (i) = df x /dx(i) can be computed analytically. Alternatively, these matrices can be estimated 
numerically by perturbing the inputs to / and observing the output perturbations. The Jacobian 
matrices are then directly multiplied by A j[i + 1). 



3xi 


Figure 2: Forward and backward signal paths through one stage. 

5 Experimental Results 

This section presents the results of applying the time-optimal techniques to a simple optimal control 
problem. This Zermelo problem, proposed by Bryson [1], consists of a boat navigating in a river 
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with a linear current profile as shown in figure 3. The controller is required to steer the boat to 
the goal at the center of the river in minimum time from many initial positions. This system was 
chosen because it is simple enough to allow the family of optimal trajectories to be visualized and 
yet provides an interesting time-minimization problem. 


river bank 



Figure 3* Control of a boat navigating through a river with linear current profile . 

The boat state-vector was [x,j/,0 r ], where (x,y) was the position of the bow of the boat with 
respect to the goal and 6 r = 9 — tan^y/x) was the direction of motion of the boat relative to the 
direction to the goal. This choice of relative angle instead of absolute angle prevented the desired 
direction from being a discontinuous function of position along the radial {x,y ; tan _1 (y/x) = 7 r}. 
The control vector was the change in boat direction over a sampling interval, A0. The velocity of 
the boat was a constant with respect to the water, V = 1, and the current profile was chosen to be 
V T = —yf 25. This gave the plant equations 

*(*■+1) = *(«) + COS (o r {i) + tan -1 ^ 

\ x(t)J 25 

y(i + l) = y{i) + sin (o r (i) + tan -1 

V x{i)J 

0 r {i +1) = 0 r (i) + u(j') normalized to [ — zr, 

The initial state for each run, (xo, yo, Or, o), was chosen from a uniform distribution over 
{x,y,d r : x e [-50,50], y£ [-50,50], 0 T [— tt/4, tt/4]}. 

The cost function consisted of a soft quadratic end-constraint and a time minimization term: 

J = E [N°At + x(N 0 ) 2 + 2 /(AT 0 ) 2 ] . 

The controller and time-to-go neural networks each had 3 input nodes, 10 hidden nodes, and 1 
output node. The number of hidden nodes was chosen by evaluating performance with different 
network configurations and picking the smallest network with satisfactory performance. 
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Before training began, the controller weights were initialized to random values. A few sample 
trajectories for the untrained controller are shown in figure 4. These and all subsequent trajec- 
tories are plotted against paths derived by optimal control methods for each of the initial states 
separately 3 . The controller was then trained to emulate a coarse control law, thus providing an 
initial, rough guess of the weights. This coarse controller simply pointed the boat towards the goal 
at all times. Sample trajectories after this weight initialization training are shown in figure 5. No- 
tice that the paths are not close to the optimal trajectories. This pre-training was done in order to 
keep the subsequent, on-line learning process stable. This weight initialization scheme was chosen 
as an alternative to the procedure used by Nguyen and Widrow [5] which required a scheduling of 
the initial states presented to the system. Their procedure adapted the controller based on many 
initial states whose N°(xq) were small before initial states with larger N°(xq) were attempted. 

The two extended BPTT algorithms were then used to further train the control system to 
minimize the above cost function. Trajectories for method 1 after about 150,000 training cycles 
with fi# = 10 -3 and fi# = 10“ 4 are shown in figure 6. Figure 7 shows the same trajectories for 
method 2 after about 30,000 training cycles with pe = 10” 2 . Larger values of p caused learning 
instabilities. Although the trajectories were not identical to the optimal control solutions, they were 
very close. In figure 7, all of the trajectory times are within At of the optimal control solutions. In 
figure 6 the times differ by at most 3 At. Algorithm 1 was found empirically to require smaller /i 
than algorithm 1 to prevent the learning process from diverging. Also, the paths in algorithm 1 were 
not quite as close to the optimal continuous solutions. One possible reason for this was that errors 
in the time-prediction mapping may have caused artificial errors in the control mapping. Since the 
time-to-go is not explicitly represented in algorithm 2, it does not suffer from this problem. 


6 Conclusions 

This paper was written with two goals. The first was the presentation of two algorithms which 
extend backpropagation through time (BPTT) to terminal control problems with unknown final 
time. One algorithm that uses an auxiliary network to explicitly predict the optimal final-time is 
of theoretical interest because it is a direct extension of optimal control techniques, specifically, the 
optimal control method which satisfies a transversality condition through gradient descent on tj. 
A second algorithm stops runs at the time-step which minimizes the total cost function, including 
the time-minimization term. This algorithm was found to be less sensitive to the learning rate, less 
likely to diverge, and easier to implement than was the first algorithm. Because of this, use of the 
second algorithm is recommended unless the time-to-go mapping is required. Even in situations 
which do not require final-time minimization, it is still necessary to decide upon a stopping point 
for the forward run. The choice of N should not be made indiscriminately, as this choice will have 
a direct effect on the controller weights. For example, a stopping heuristic which tends to pick 
small N might result in a controller which uses larger control effort than otherwise. Because of 

3 Recall from section 3 that the optimal control formulation is not valid for discrete time controllers with open 
final-time. Thus, comparisons were made using a continuous plant and Bryson’s fcnopt routine [13]. 
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this, algorithm 2 should be used to stop the run at the iteration which minimizes the required cost 
function, even if the cost does not involve trajectory-time. 

The second goal of this paper was the demonstration of the relationship between classical 
optimal control methods and BPTT. BPTT, with the open final-time extensions presented here, can 
be directly derived using standard optimal control methods, and the propagated errors in BPTT 
are equivalent to the Lagrange multipliers in optimal control. A realization of the similarities 
between optimal control and BPTT will allow the application of optimal control techniques to 
neural networks, while, conversely, the ability of neural networks to realize a wide class of nonlinear 
functions will permit them to solve problems in classical optimal control that might otherwise have 
been difficult. 

The use of a multi-layer neural network as a general tool for synthesizing an optimal state- 
feedback terminal controller depends on certain assumptions. First, by definition, the neural net- 
work realizes a continuous mapping from state to control. In terminal control problems, the desired 
mapping is not necessarily continuous. For example, in the boat problem above, if the direction 
had not been chosen as a relative angle, there would have been a discontinuity. Although a net- 
work can approximate a function with discontinuities, I have found that, in practice, this makes it 
difficult to obtain convergence for terminal controller problems. Second, due to the structure of the 
network, the range of state space over which the network attempts to learn the optimal control law 
must be restricted. We also assume that the problem is stationary, so the optimal cost function, 
J° , does not depend explicitly on time. Thus the controller weights 9 are not dependent on time 
either. Finally, and perhaps most importantly, the methods described assume full state feedback 
is possible. 

There are a number of issues open to future research. Necessary first order stationary conditions 
have been examined, but an investigation of sufficient conditions for local minima with regard to 
conjugate and focal points [1] is still needed. Furthermore, the neural network control scheme 
presented here relies on full state information. In situations where the controller does not have 
this information, some form of state estimation must be used. This is an issue that needs to be 
addressed in the context of neural network terminal controllers. 
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Position upstream, X 

Figure 5: Trajectories for 3x10x1 neural controller after rough weight initialization. 
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Position upstream, X 

Figure 6: Trajectories for 3x10x1 neural controller trained using method 1. 
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Position upstream, X 

Figure 7: Trajectories for 3x10x1 neural controller trained using method 2. 
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